The Vector Space Model in Information Retrieval - Term Weighting Problem

نویسنده

  • Nicola Polettini
چکیده

Many traditional information retrieval (IR) tasks, such as text search, text clustering or text categorization, have natural language documents as their first-class objects, in the sense that the algorithms that are meant to solve these tasks require explicit internal representations of the documents they need to deal with. In IR documents are usually given as extensional vectorial representation, in which the dimensions (features) of the vector representing a document are the terms occurring in the document. The approach to term representation that the IR community has almost universally adopted is known as the bag-of-words approach: a document dj is represented as a vector of term weights −→ dj = 〈ω1j , ..., ωrj〉, where r is the cardinality of the dictionary and 0 ≤ ωkj ≤ 1 represents the contribution of term tk to the specification of the semantics of dj . This article analyses and compares many different bag-of-words approaches.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Design of Thesis Topic Search Engine with Information Retrieval and Vector Space Model of TF-IDF Weighting

The development of internet makes improvement in relevant information needs. A way to get relevant information in internet is by using search engine application. Search engine application is a form of information retrieval system. Thesis searching is a problem that students face in their final study. A way to help them solve their problem is by using search engine, especially search engine that...

متن کامل

Document weight Query weight Top ten Scheme name

the goal in information retrieval is to enable users to automatically and accurately retrieve data relevant to their queries. One possible approach to this problem is to use the vector space model, which models documents and queries as vectors in the term space. The components of the vectors are determined by the term weighting scheme. This paper compared between a selected set from the availab...

متن کامل

Problem 4 : Term Weighting Schemes in Information Retrieval

Information retrieval is the process of evaluating a user's query, or information need, against a set of documents (books, journal articles, web pages, etc.) to determine which of the documents satisses the query. With the advent of the World Wide Web, there is suddenly a need to query enormous sets of documents both eeciently and accurately. In the vector space model of information retrieval, ...

متن کامل

A Learning-Based Term-Weighting Approach for Information Retrieval

One of the core components in information retrieval(IR) is the document-term-weighting scheme. In this paper,we will propose a novel learning-based term-weighting approach to improve the retrieval performance of vector space model in homogeneous collections. We first introduce a simple learning system to weighting the index terms of documents. Then, we deduce a formal computational approach acc...

متن کامل

Relating the new language models of information retrieval to the traditional retrieval models

During the last two years, exciting new approaches to information retrieval were introduced by a number of different research groups that use statistical language models for retrieval. This paper relates the retrieval algorithms suggested by these approaches to widely accepted retrieval algorithms developed within three traditional models of information retrieval: the Boolean model, the vector ...

متن کامل

Beyond TFIDF Weighting for Text Categorization in the Vector Space Model

KNN and SVM are two machine learning approaches to Text Categorization (TC) based on the Vector Space Model. In this model, borrowed from Information Retrieval, documents are represented as a vector where each component is associated with a particular word from the vocabulary. Traditionally, each component value is assigned using the information retrieval TFIDF measure. While this weighting met...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004